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ABSTRACT 


In  this  tutorial,  we  discuss  several  practical  issues  regarding  specification  and  solution  of  dependability  and 
performability  models.  We  compare  model  types  with  and  without  rewards.  Continuous-time  Markov  chains 
(CTMCs)  are  compared  with  (continuous-time)  Markov  reward  models  (MRMs)  and  generalized  stochastic 
Petri  nets  (GSPNs)  are  compared  with  stochastic  reward  nets  (SRNs).  It  is  shown  that  reward-based  models 
could  lead  to  more  concise  model  specification  and  solution  of  a  variety  of  -ew  measures.  With  respect  to  the 
solution  of  dependability  and  performability  models,  we  identify  three  practical  issues:  largeness,  stiffness, 
and  non-exponentiality,  and  we  discuss  a  variety  of  approaches  to  de2d  with  them,  including  some  of  the 
latest  research  efforts. 
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1  Introduction 


Dependability,  performance,  and  performability  evaluation  techniques  provide  a  useful  method  for  under¬ 
standing  the  dynamic  behavior  of  a  computer  or  communication  system.  To  be  useful,  the  evaluation 
should  reflect  important  system  characteristics  such  as  fault-tolerance,  automatic  reconfiguration,  and  re¬ 
pair;  contention  for  resources;  concurrency  and  synchronization;  deadlines  imposed  on  the  tasks;  and  graceful 
degradation.  Furthermore,  complexity  of  current-day  systems  and  corresponding  system  evaluation  should 
be  explicitly  addressed. 

Traditional  performance  evaluation  is  concerned  with  contention  for  system  resources.  Performance 
evaluation  of  parallel  and  distributed  systems  also  address  concurrency  and  synchronization  of  tasks.  Real¬ 
time  system  performance  evaluation  takes  into  account  v2trious  hard  and  soft  deadlines  on  task  exection 
times. 

Reliability,  availability,  safety,  and  related  measures  are  collectively  known  as  dependability.  Depend¬ 
ability  evaluation  encompasses  fault-tolerance,  reconfiguration,  and  repair  aspects  of  system  behavior.  More 
recently,  interest  in  combining  performance  and  dependability  evaluation  has  grown.  Such  performability 
evaluation  considers  the  graceful  degradation  of  the  system  in  addition  to  the  dependability  aspects. 

While  measurement  is  an  attractive  option  for  assessing  an  existing  system  or  a  prototype,  it  is  not  a 
feasible  option  during  the  system  design  and  implementation  phases.  Model-based  evaluation  has  proven  to 
be  an  attractive  alternative  in  these  cases.  A  model  is  an  abstraction  of  a  system  that  includes  sufficient 
detail  to  facilitate  an  understanding  of  system  behavior.  Several  types  of  models  are  currently  used  in 
practice.  The  most  appropriate  type  of  model  depends  upon  the  complexity  of  the  system,  the  questions  to 
be  studied,  the  accuracy  required,  and  the  resources  avulable  for  the  study. 

Discrete-event  simulation  is  the  most  commonly  used  modeling  technique  in  practice  but  it  tends  to  be 
relatively  expensive.  Analytical  modeling  provides  a  cost-effective  alternative  to  simulation  for  studying  the 
performance  and  dependability  of  computer  and  communication  systems.  Due  to  recent  developments  in 
model  generation  and  solution  techniques  and  automated  tools,  large  and  realistic  models  can  be  developed 
and  studied.  In  this  tutorial  we  concentrate  on  such  analytic  models.  The  rest  of  this  tutorial  is  organized  as 
follows.  In  the  next  section,  we  present  an  overview  of  various  approaches  to  dependability  and  performance 
modeling.  In  Section  3,  we  show  how  performability  analysis  can  be  carried  out  using  MRMs.  We  also 
show  how  dependability  measures  can  be  obtained  via  performability  emalysis  using  special  reward  rate 
etssignment. 

In  Section  4,  we  compare  GSPNs  and  stochastic  reward  nets.  In  Section  5,  we  discuss  in  detail  some  prac¬ 
tical  issues  in  solving  dependability  and  performability  models:  largeness,  stiffness,  and  non-exponentiality. 
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Figure  1:  A  reliability  block  diagram  model. 


2  Approaches  to  Modeling 

2.1  Dependability  Modeling 

Reliability  block  diagrams,  fault  trees,  and  reliability  graphs  are  commonly  used  to  study  the  dependability 
of  systems  [59].  Although  these  models  are  concise  and  have  efficient  solution  methods,  they  cannot  represent 
dependencies  among  components  [56]  as  easily  as  CTMC  models  can  [21,  23]. 

We  begin  by  considering  a  fault-tolerant,  multi-processor  computer  with  multiple,  shared  memory  mod¬ 
ules.  The  system  is  able  to  detect*  a  processor  or  memory  module  failure  and  reconfigure  itself  to  continue 
operation  without  the  failed  component.  The  system  can  operate  with  just  one  processor  and  one  memory 
module. 

Our  first  model  of  this  system  is  the  reliability  block  diagram  in  Figure  1.  We  could  attach  to  each 
component  the  probability  of  having  failed  by  a  particular  time.  In  a  more  general  parameterization,  a 
failure  time  distribution  function,  rather  than  a  probability  value,  can  be  attached  to  each  component.  For 
example,  one  can  assign  the  exponential  distribution  Ff(t)  =  1  -  e“*»*  to  processors  and  Fm(t)  =  1  —  e"*~* 
to  memories.  We  can  request  the  system  failure  time  distribution  as  a  function  of  the  time  variable  i.  For  a 
system  with  two  processors  and  three  memory  modules, 

F.,.(t)  =  1  -  (1  -  (1  -  e-*'*)»))  •  (1  -  (1  -  e-*-*)’))  . 


We  can  also  ask  for  the  mean  time  to  system  failure, 
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Now  suppose  we  want  to  investigate  a  different  computer  design  where  the  two  processors  have  fast  private 
memory  modules  and  the  system  has  slower,  shared  memory  modules.  We  assume  that  the  system  operates 
as  long  as  there  is  at  least  one  operational  processor  with  access  to  either  a  private  or  shared  memory.  We 
cannot  model  this  system  with  a  block  diagram,  because  there  is  no  way  to  model  how  the  shared  memories 
are  connected  to  all  processors  while  private  memories  are  connected  to  particular  processors.  So,  we  turn 
to  a  fault  tree  model,  shown  for  two  processors  and  three  memory  modules  in  Figure  2.  We  could  also  use 
a  reliability  graph,  where  time-to-failure  distributions  are  assigned  to  the  edges.  The  system  is  operational 
as  long  as  there  is  a  path  from  source  (src)  to  sink,  in  this  particular  model  (Figure  3),  processor  failures 


I  Failure  I 


Figure  2:  A  fault  tree  model. 


Figure  3:  A  reliability  graph  model. 

happen  along  the  edges  labeled  PI  and  P2  and  memory  failures  happen  along  the  edges  Ml,  M2,  and  MS. 
The  edges  II  and  12  do  not  represent  system  components;  they  represent  the  structure  of  the  system  (the 
sharing  of  MS).  We  assign  the  “infinite”  distribution,  defined  by  I{t)  =  0,  to  them.  There  is  a  path  from 
source  to  sink  if  PI  and  Ml  are  up  or  if  PI  and  MS  are  up,  and  similarly  for  paths  involving  PS.  Analysis 
of  the  reliability  graph  results  in  the  same  failure  time  distribution  as  the  fault  tree  analysis. 

Now  we  extend  our  models  to  take  into  account  repur  or  replacement  of  parts.  We  calculate  the  “avail¬ 
ability”  of  the  system,  the  (transient  or  steady-state)  probability  that  the  system  is  fimctioning.  We  examine 
the  all-shared-memory  system  and  look  at  three  repair  strategies: 

1 .  There  are  enough  repair  resources  to  repair  all  components  at  the  same  time,  if  necessuy. 

2.  There  are  two  repair  facilities,  one  for  processors  and  one  for  memory  modules,  each  able  to  handle 
one  component  at  a  time. 

3.  There  is  one  repair  facility,  able  to  handle  one  component  at  a  time.  Processor  repair  has  preemptive 
priority  over  memory  repair. 

For  the  first  strategy,  the  state  of  the  components  (either  up  or  down)  are  mutually  independent,  since 
the  failure  and  repair  of  each  component  does  not  depend  on  that  of  any  other  component.  Because  of  this 
independence,  we  can  use  the  block  diagram  used  to  model  reliability  (Figure  1)  to  model  availability  as  well. 
Instead  of  assigning  to  each  component  the  time-to-failure  distribution,  we  use  the  transient  unavailability.  If 
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Figure  4:  A  CTMC  model. 


the  I**  component  has  exponentially  distributed  failure  behavior  with  rate  and  repair  is  also  exponentially 
distributed  with  rate  m,  its  unavailability  at  time  t  is 

Ui{t)  =  - (1) 

A< + m  A,-  +  m 

and  the  steady-state  unavailability  is  given  by 

lim  Ui{t)  = 

t-oo  A<-+;i.- 

These  expressions  can  be  derived  by  solving  the  two-state  (up/down)  CTMC  for  a  component  [62]. 

If  we  analyse  the  reliability  block  diagram  of  Figure  1  with  the  assignment  of  distribution  functions  of 
Equation  1  to  the  components,  the  resulting  function  is  the  system  unavailability  at  time  t,  U,yt{t),  and  the 
“mass  at  infinity”  (1  —  lim(_oo  is  the  steady-state  system  availability. 

To  deal  with  the  second  and  third  repair  strategies,  we  can  no  longer  use  the  block  diagram  model.  The 
block  diagram  assumes  that  all  components  are  statistically  independent,  but,  if  components  share  repair 
facilities,  the  failure  and  repair  behavior  of  one  component  is  dependent  on  the  state  of  all  components. 

If  the  failure  and  repair  distributions  are  exponential,  we  can  use  a  CTMC  model.  Consider  the  CTMC 
in  Figure  4.  State  mp  represents  the  system  when  m  memory  units  and  p  processors  are  functional.  The 
model  with  all  of  the  solid  and  dashed-line  transitions  is  for  the  second  repmr  strategy  (one  repair  facility  for 
processors  and  one  for  memories).  The  model  for  the  third  strategy  (only  one  repair  facility  giving  priority 
to  the  processors)  is  obtained  by  excluding  the  dashed  lines,  since  no  memory  is  repaired  while  there  are 
failed  processors. 

We  note  that  we  could  have  used  a  CTMC  for  the  first  repair  strategy  as  well.  We  would  have  assigned 
different  transition  rates  to  the  repair  transitions  to  reflect  the  fact  that  more  than  one  component  can  be 
repaired  at  a  time.  As  an  example,  the  rate  for  the  transition  from  OS  io  12  would  be  3  *  Pm  rather  than 
fim-  The  block  diagram  model,  though,  is  both  easier  to  construct  and  more  efficient  to  analyze. 
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Figure  5:  A  GSPN  model. 

Before  leaving  the  subject  of  unavailability,  we  illustrate  the  use  of  one  more  model  type,  the  GSPN.  For 
a  discussion  of  this  model  type,  the  reader  is  referred  to  [1].  Modeling  the  availability  of  this  system  with  a 
GSPN  does  more  than  just  give  us  another  validity  check.  It  allows  us  to  find  the  unavailability  for  a  system 
with  any  number  of  processors  and  memories  without  having  to  construct  a  separate  model  for  each  number 
of  components.  The  GSPN  in  Figure  5  is  a  model  of  the  system  in  which  there  is  one  repair  facility  to  be 
shared  for  all  components. 

There  is  a  token  for  each  processor  and  each  memory.  Initially,  there  are  tokens  in  the  place  ppup  (place: 
processors  up)  and  tokens  in  the  place  pmup  (place:  memories  up).  When  a  processor  fails,  its  token 
moves  from  place  ppup  through  transition  tpfail  (transition:  processor  fails)  to  place  pprtp  (place:  processor 
waiting  for  repair).  Processor  repair  is  represented  by  a  token  moving  from  place  pprtp  through  transition 
iprep  to  place  ppup.  The  inhibitor  arcs  from  pprtp  to  tmfttil  and  pmrtp  to  tpfail  reflect  e  assumption  that 
if  the  system  has  already  failed  because  all  processors  or  all  memories  have  failed,  the  remaining  working 
components  do  not  fail  while  they  are  not  running.  This  aspect  of  the  system  was  modeled  only  implicitly  in 
the  CTMC  model,  by  the  absence  of  failure  transitions  from  the  places  with  either  no  operating  processors 
or  no  operating  memory  modules.  The  inhibitor  arc  from  pprtp  to  tmrtp  is  the  one  that  represents  our 
assumption  that  there  is  only  one  repair  facility;  if  there  are  any  failed  processors,  there  can  be  no  memory 
repair. 

We  can  verify  that  analyzing  this  GSPN  with  tip  =  2  and  =  3  gives  the  same  result  for  system  steady- 
state  unavailability  as  the  CTMC  model.  We  note  that  the  GSPN,  although  a  more  efficient  specification, 
is  no  more  efficient  to  analyze  than  the  CTMC,  since  analysis  of  a  GSPN  involves  translating  the  GSPN 
into  a  CTMC.  However,  dependability  modeling  with  GSPN  tends  to  be  clumsy  [41].  Stochastic  reward  nets 
remove  this  restriction  from  GSPN  models.  We  elaborate  more  on  this  in  Section  4. 

2.2  System  Performance  Models 

In  this  section,  we  look  at  aspects  of  system  performance,  including  performance  of  gracefully  degraded 
systems.  In  the  performance  domain,  task  precedence  graphs  [31,  55]  can  be  used  to  model  the  perfor¬ 
mance  of  concurrent  programs  with  unlimited  resources.  Product  form  queueing  networks  [35,  36],  on  the 
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Figure  6:  A  product  form  queueing  network  for  the  system  with  three  shared  memories. 


Figure  7:  A  product  form  queueing  network  for  the  system  with  one  shared  memory  and  two  local  memories. 

other  hand,  can  represent  contention  for  resources.  However  they  cannot  model  concurrency  within  a  job, 
synchronization,  or  server  failures,  since  these  violate  the  product  form  assumptions. 

We  consider  the  same  two  system  architectures  as  in  Section  2.1:  the  first  containing  two  processors 
and  three  shared  memory  modules  and  the  second  containing  two  processors,  each  with  a  private  memory 
module,  and  one  shared  memory  module. 

To  capture  the  effects  of  contention  for  the  processor  and  memory  resources,  we  use  queueing  network 
models.  We  assume  that  the  memory  modules  are  servers  in  the  sense  that  they  queue  requests  and  perform 
block  transfers.  To  set  up  a  realistic  queueing  model,  we  would  have  to  take  into  account  the  proposed  oper¬ 
ating  system  design,  especially  the  scheduling  aspects,  and  we  would  need  some  kind  of  expected  workload 
characterization.  For  the  sake  of  illustration,  we  use  the  closed  queueing  network  models  shown  in  Figures 
6  and  7. 
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The  network  in  Figure  6  is  for  the  design  containing  two  processors  and  three  shared  memory  modules. 
We  model  the  two  processors  by  a  multiple-server  station.  That  is,  jobs  wait  in  a  single  queue  and  enter 
whichever  server  becomes  free.  When  a  job  wants  to  access  the  memory,  it  requires  memory  module  Mi  with 
probability  pr^ .  After  some  visits  to  the  processor,  a  job  finishes;  pro  is  the  probability  that  a  job  is  finished 
when  it  leaves  one  of  the  processors.  As  is  usual  for  closed  queueing  networks,  the  assumption  is  that  each 
finished  job  is  repl2u;ed  by  a  statistically  identical  new  job. 

The  network  in  Figure  7  is  for  the  design  containing  two  private  memory  modules.  For  this  system,  we 
assume  that  jobs  are  targeted  to  particular  processors.  This  is  reasonable,  since,  once  a  job  starts  on  a 
processor,  we  want  it  to  continue  where  it  has  access  to  that  processor’s  private  memory.  We  carry  out  this 
assumption  by  making  the  queueing  network  a  “multiple-chain”  queueing  network,  in  this  case  having  two 
“chains”,  or  classes  of  jobs.  Jobs  in  the  first  class  go  from  PI  to  either  Ml  or  Ms  and  back  to  PI  and  jobs 
in  the  second  class  go  from  P2  to  either  M2  or  Ms  and  back  to  P2. 

As  expected,  the  system  with  private  memories  provides  higher  system  throughput  as  opposed  to  that 
for  the  shared-memory  system. 

To  model  the  systems  when  one  memory  has  failed,  we  remove  the  server  Ml  (and  its  queue)  from  each 
of  the  models  and  adjust  the  probabilities  pr,-  and  prtj  appropriately. 

Queueing  models  are  able  to  capture  the  effects  of  resource  contention,  but  measures  related  to  the  total 
number  of  jobs  serviced  do  not  capture  the  performance  of  the  system  as  seen  by  a  single  parallel  program: 
series-parallel  acyclic  graph  models  [55]  can  be  used  for  this  purpose. 

Also  CTMCs  provide  a  useful  framework  to  model  system  performance,  but  a  detailed  CTMC  model 
is  often  large  and  complex  and  its  construction  is  an  error-prone  process.  Hence  there  is  a  need  for  a 
higher-level  model-type  having  an  underlying  CTMC,  which  is  then  automatically  generated  from  it.  Some 
attempts  in  the  specific  instance  of  dependability  modeling  have  resulted  in  useful  packages  like  SAVE  [23], 
for  availability  modeling,  which  uses  a  block  diagram  input,  and  HARP  [21],  for  reliability  modeling,  which 
uses  a  fault-tree  input.  A  suitable  interface  is  necessary  for  a  more  general  modeling  environment.  GSPNs 
[1]  and  SRNs  [13]  provide  an  excellent  interface  for  detailed  performance  modeling  of  complex  systems. 

The  advent  of  fault-tolerant  computing  has  resulted  in  the  design  of  machines  which  continue  to  function 
even  in  the  presence  of  failures,  albeit  at  a  reduced  level  of  performance.  Pure  reliability  or  performance 
models  of  such  systems  do  not  capture  the  whole  picture.  This  has  prompted  researchers  to  consider  the 
combined  evaluation  of  performance  and  reliability  [44,  63].  The  CTMC  is  extended  by  associating  rewards 
with  its  states  to  obtain  a  “Markov  reward  process”,  or  “Markov  reward  model”  (MRM).  This  process  not 
only  facilitates  modeling  of  performance  and  reliability  but  also  the  combined  evaluation  of  performance  and 
reliability.  Since  this  paper  considers  the  automatic  generation  of  the  CTMC  from  the  GSPN  description  of 
the  model,  the  reward  structure  must  also  be  defined  in  terms  of  the  GSPN  entities.  Consequently  the  GSPN 
description  is  modified  to  obtain  “stochastic  reward  nets"  [13]  which  can  be  automatically  transformed  to 
obtain  the  underlying  MRM. 
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3  CTMCs  versus  MRMs 


CTMCs  have  been  traditionally  used  to  model  dependability.  MRMs  [25]  are  CTMCs  in  which  reward  rates 
may  be  associated  with  states  of  the  CTMC  {rate-type  rewards)  or  with  transitions  of  the  CTMC  {impulse- 
type  rewards)  or  both.  We  consider  MRMs  with  rate-type  rewards.  MRMs  have  been  successfully  used  for 
performability  analysis  [44,  63]  according  to  the  following  methodology.  Initially,  a  dependability  model  (also 
known  as  structural  model)  of  the  system  is  constructed.  Assuming  the  dependability  model  is  state-8pau;e 
type  (such  as  a  CTMC),  a  performance  measure  is  obtained  (possibly  by  solving  a  performance  model)  for 
each  state  of  the  dependability  model.  This  performance  measure  becomes  the  reward  rate  of  that  state 
in  the  dependability  model.  With  the  reward-rate  assignment,  the  dependability  model  becomes  an  MRM 
which  may  then  be  solved  for  various  performability  measures.  There  is  an  approximation  involved  in  this 
decomposition  of  performance  and  dependability  models:  the  system  is  assumed  to  have  attained  (quasi- 
)8teady-state  in  each  state  of  the  dependability  model,  so  that  the  reward  rate  for  each  state  of  the  reliability 
model  is  a  steady-state  performance  measure.  TVansient  or  steady-state  analysis  of  the  dependability  model 
with  rewards  is  then  carried  out.  The  justification  for  this  decomposition  lies  in  the  fact  that  the  performance 
activities  are  much  faster  than  the  dependability  events. 

CTMCs  can  also  be  used  for  performability  analysis  if  a  monolithic  model  is  constructed  which  combines 
both  the  dependability  and  performance  model  of  the  system.  However,  the  state-space  of  this  model  is 
approximately  the  cross-product  of  state-spaces  of  the  dependability  2md  performance  models.  In  addition, 
this  monolithic  model  is  stiff  because  of  extreme  disparity  between  the  transition  rates  (job  arrival  rates 
could  be  10^  times  or  more  than  the  fault  occurrence  rates).  One  may  argue  that  this  approach  is  more 
accurate  than  the  MRM  approach  since  no  approximation  is  involved.  However,  this  gain  in  accuracy  may 
well  be  negated  due  to  the  computational  problems  posed  by  largeness  and  stiffness  of  the  monolithic  model. 
We  focus  more  on  these  two  problems,  largeness  and  stiffness,  in  later  sections.  The  MRM  approach  has 
another  significant  advwtage.  No  assumptions  are  made  about  how  the  reward  rates  are  obtained.  The 
reward  rates  may  be  obtained  by  simulation,  by  solving  a  queuing  network,  or  by  solving  a  semi-Markov 
process  (SMP),  etc. 

It  is  easy  to  see  that  CTMCs  are  special  cases  of  MRMs  and  therefore  dependability  analysis  becomes  a 
special  case  of  performability  analysis.  In  this  section,  we  briefly  show  how  various  dependability  measures 
can  be  analyzed  as  performability  measures  when  the  MRM  has  a  special  reward-rate  assignment.  Let 
{Q{t),t  >  0}  be  an  MRM  with  state  space  'i'  and  constant  reward  rate  r,-  associated  with  each  state  t  of 
the  CTMC.  If  the  MRM  spends  units  of  time  in  state  t,  then  r,-ri  is  the  reward  accumulated  during  this 
sojourn.  Let  Q  be  the  generator  matrix  and  P(t)  be  the  state  probability  vector  of  the  MRM.  Here  Pi{t) 
denotes  the  transient  probability  of  the  MRM  being  in  state  t  at  time  t.  The  transient  behavior  of  this  MRM 
is  given  by  the  Kolmogorov  differential  equation: 


dP{t) 

dt 


=  P(0Q  . 


(2) 
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given  the  initial  state  probability  vector  P(0).  The  steady-state  probability  vector  ir  (assuming  that  it  exists 
and  is  unique)  is  obtained  by  setting  the  l.h.s.  in  Equation  2  to  zero: 


irQ  =  0  ,  »<  =  1 

i€* 


The  cumulative  state  probability  vector  of  the  MRM  is  defined  as  L{t)  =  fg  P(x)clz,  where  Lj(t)  denotes  the 
expected  total  time  spent  by  the  MRM  in  state  t  during  the  interval  [0,0-  To  compute  L(t),  we  integrate 
Equation  2: 


^L(t) 

dt 


l(oq  +  p(0) 


The  reward  rate  at  time  t  for  the  MRM  is  given  by  T(t)  =  r^^t)-  The  accumulated  reward  over  the 
interval  [0,  t)  is  given  by; 

m)  =  /  T(x)d*  =  /  re(,)di  . 

Jo  Jo 

The  expected  reward  rate  at  time  t  of  the  MRM  is; 


»€* 

The  expected  reward  rate  in  steady-state  for  the  MRM  is: 

£[T,»]  =  . 

To  compute  availability  measures,  the  state-space  of  the  MRM  is  partitioned  into  two:  a  set  of  system-up 
states,  with  reward  rate  1,  and  a  set  of  system-down  states,  with  reward  rate  0.  We  term  this  &  0-1  reward 
assignment.  The  transient  availability  of  the  system  is  given  by  £[T(t)]  and  steady-state  availability  is  given 
by  E[T,.]. 

The  expected  accumulated  reward  over  the  interval  (0,  <)  is: 

»€♦ 

The  expected  time-averaged  reward  rate  over  the  interval  [0,  t)  is  given  by  53$  availability 

model  with  0-1  reward  assignment,  the  total  uptime  of  the  system  over  the  interval  [0,  t)  is  £[$(<)].  Interval 
availability  is  the  proportion  of  time  a  system  is  up  in  a  given  interval  of  time  and  it  is  given  by  £[$(<)]/< 
for  the  interval  [0,t). 

For  MRMs  with  absorbing  states,  the  state-space is  partitioned  into  two:  (set  of  absorbing  states) 

and  'ir  (set  of  transient  states).  Let  Qt  be  the  submatrix  of  Q  corresponding  to  the  transitions  between 
transient  states.  The  mean  time  spent  by  the  MRM  in  state  i  €  '9t  before  absorption  is  given  by  r,-  = 
/g°°  Pi(x)dx,  which  is  obtained  by  integrating  Equation  2  from  0  to  oo: 


’■Qt  +  Pt(0)  =  0  . 
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The  mean  time  to  absorption  is  given  by: 

MTTA  =Y^Ti. 

«€*t 

To  compute  reliability  measures,  all  the  system-down  states  are  made  absorbing  states  (transitions  leaving 
from  them  are  deleted).  The  same  0-1  reward  assignment  is  used.  The  rcliabiliiy  is  given  by  £'[T(t)]. 
The  lifetime  (similar  to  total  uptime)  [20]  of  the  system  over  the  interval  [0,t)  is  £?[♦(!)].  The  expected 
accumulated  reward  until  absorption  is; 

£[♦(00)]  =  Tin  . 

•e*T 

and  the  mean  time  to  failure  (MTTF)  of  the  system  is  £^[4(oo)]. 

The  distribution  of  the  reward  rate  at  time  t,  T(t),  is  computed  as: 

p[T(o<t(']=  Y,  • 

The  distribution  of  accumulated  reward  until  absorption  or  a  finite  period  can  also  be  computed.  If  the 
time  to  accumulate  a  given  reward  r  is  r(r),  then  the  distribution  of  r(r)  is  known  once  the  distribution  of 
accumulated  reward  is  known  [32]: 

P[r(r)<l]  =  l-P[$(0<r]  .  (3) 

For  instance,  the  distribution  of  time  to  complete  a  job  that  requires  r  units  of  processing  time  on  a  system 
which  is  modeled  by  an  MRM  can  be  computed  in  this  fashion. 

From  the  above  discussion,  it  is  clear  that  dependability  analysis  can  be  carried  out  using  MRMs  with 
special  reward  rate  assignment  to  various  system  states.  This  analysis  can  also  be  carried  out  using  CTMCs 
(without  rewards)  in  an  equally  efficient  manner.  However,  performability  analysis,  which  can  be  easily 
carried  out  using  MRMs,  becomes  cumbersome  if  rewards  are  not  used. 

4  SPNs  versus  SRNs 

CTMCs  modeling  real  systems  tend  to  be  large,  sometimes  with  hundreds  of  thousands  states.  A  higher- 
level  specification  mechanism  is  thus  needed  for  the  concise  description  of  the  model  and  the  automatic 
conversion  into  a  CTMC.  Stochastic  Petri  nets  (SPNs)  provide  such  a  mechanism.  Molloy  [48]  used  SPNs 
for  performance  analysis  and  showed  that  they  are  isomorphic  to  CTMCs.  Since  then,  several  extensions 
have  been  made  to  SPNs.  Some  of  these  extensions  have  enhanced  the  flexibility  of  use  and  allowed  for  even 
more  concise  description  of  performance  and  reliability  models.  Some  other  extensions  have  enhanced  the 
modeling  power  by  allowing  for  non-exponential  distributions  (see  Section  5.3). 

In  this  section,  we  compare  SPNs  with  and  without  rewards.  Specifically,  we  compare  SRNs  as  defined 
by  Ciardo  et  al.  [13]  and  GSPNs  as  defined  by  AJmone-Marsan  et  al.  [2].  SRNs  are  an  extension  of  GSPNs, 
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Figure  8:  A  simpie  network 

since  they  include  all  the  features  of  GSPNs  and  add  more  features.  There  are  several  structural  extensions 
such  as  guards  (earlier  known  as  enabling  functions),  priorities  with  timed  transitions,  marking-dependent 
arc  cardinalities,  and  halting  condition.  Besides  the  structural  extensions,  a  reward  rate  function  associates 
a  reward  rate  with  each  reachable  marking.  GSPNs  and  SRNs  have  been  shown  to  be  isomorphic  to  CTMCs 
and  MRMs  respectively.  However,  we  show  in  this  section  that  SRNs  allow  a  much  more  concise  description 
of  system  behavior  than  GSPNs.  This  is  particularly  true  for  dependability  models.  Furthermore,  certain 
reward-based  measures  as  described  in  Section  3  can  be  computed  using  SRNs  but  cannot  be  computed 
using  GSPNs. 

To  compare  GSPNs  and  SRNs,  we  present  an  example.  Consider  a  simple  network  between  src  and 
sink  nodes  consisting  of  three  links  (Figure  8).  The  network  is  operational  as  long  as  link  A  and  at  least 
one  of  the  links  B  or  C  is  operational.  Assuming  that  each  link  has  its  independent  repairperson,  the 
availability  of  the  network  can  be  modeled  by  the  GSPN  shown  in  Figure  9.  A  token  in  places  pA,  pB, 
and  pC  respectively  indicates  that  links  A,  B,  and  C  are  operational.  A  token  in  place  pF  implies  that 
the  network  is  failed.  A  token  in  place  pR  implies  that,  due  to  repair  of  one  or  more  links,  the  component 
is  ready  to  be  operational  again.  The  firing  of  transition  tR  removes  the  token  from  pF,  signifying  that 
the  network  is  operational.  The  steady-state  (transient)  probability  of  a  token  being  in  place  pF  gives  the 
steady-state  (transient)  unavailability  of  the  network. 

The  availability  of  this  network  can  also  be  modeled  by  an  SRN  as  shown  in  Figure  10.  The  reward  rate 
function  is  as  shown  in  the  table.  The  expected  value  of  reward  rate  r  in  steady-state  (or  at  time  t)  gives 
the  steady-state  (transient)  availability  of  the  network.  Let  us  now  compare  the  GSPN  and  SRN  models.  A 
GSPN  model  requires  a  mesh  of  immediate  transitions,  places,  and  inhibitor  arcs  to  capture  the  operational 
dependence  of  the  network  on  the  links.  Part  of  this  mesh  captures  the  dependence  such  as  the  subsystem 
of  links  B  and  C  fails  only  when  both  B  and  C  have  failed.  The  other  part  of  the  mesh  captures  the  impact 
of  repairs  of  links  which  reflect  complementary  conditions,  such  as  removal  of  a  token  from  place  pBC  as 
soon  as  either  B  or  C  is  repaired.  As  the  systems  grow  in  complexity,  this  mesh  becomes  very  complex 
and  unwieldy.  On  the  other  hand,  an  SRN  captures  the  operational  dependence  of  the  network  on  links  by 
reward  rate  function.  This  results  in  a  simpler  and  more  manageable  net. 
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Figure  9:  GSPN  availability  model  of  the  network 


Name 

Boolean  Function 

bool/i 

(#toibcn«(p>4)  ==  1) 

boolgc 

{i^tokens{pB)  ==  1)  V  (#toiben«(pC)  ==  1) 

boolnw 

bool  A  A  boolgc 

_ Reward  Rate  Function _ 

if  (boolnw  ==  1)  then  r  =  1  else  r  =  0 


Figure  10:  SRN  availability  model  of  the  network 
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5  Computational  Problems 


In  modeling  practice,  it  is  often  the  case  that  no  single  model  is  adequate  to  solve  a  problem.  Different  parts 
or  levels  of  detail  in  a  system  may  require  different  modeling  techniques.  In  cases  where  a  single  model  type 
can  be  used,  it  may  be  too  large  (a  problem  both  for  specification  and  analysis)  or  intractable  (“stiff”  or 
ill-conditioned).  Three  main  difficulties  in  analytic  models  include  largeness,  stiffness,  2uid  the  need  to  model 
non-exponential  distributions.  We  explore  these  topics  in  the  following  subsections. 

5.1  Largeness 

The  problem  of  model  largeness  can  be  handled  in  two  ways:  it  can  be  avoided  or  it  can  be  tolerated. 

5.1.1  Largeness  Tolerance 

For  the  sake  of  simplicity  we  assume  that  the  underlying  model  is  a  CTMC  or  an  MRM.  If  we  are  prepared 
to  store  and  solve  the  matrix  of  a  large  model,  we  should  start  with  a  concise  description  of  the  system 
model  and  provide  for  the  automated  generation  and  the  solution  of  the  underlying  state  space.  A  number 
of  approaches  have  evolved  for  such  specifications.  Haverkort  and  Trivedi  [24]  summarize  these  approaches. 
They  present  seven  different  classes  of  specification  techniques:  Stochastic  Petri  nets  and  their  variants, 
Communicating  processes,  Queueing  networks.  Specialized  languages.  Fault-trees,  Production  rule  systems, 
and  Hybrid  techniques.  We  refer  the  reader  to  the  cited  paper  for  further  details. 

5.1.2  Largeness  avoidance 

If  the  size  of  the  underlying  CTMC  (or  MRM)  is  so  large  as  to  preclude  generation  and  storage,  we  must 
resort  to  approximations  that  avoid  the  large  underlying  model.  State  truncation,  lumping,  decomposition 
and  fluid  models  constitute  the  types  of  approximations  that  have  been  utilized.  We  discuss  these  four 
approaches  below. 

Truncation  For  many  practical  systems,  the  exact  number  of  structural  states  in  a  corresponding  model 
might  be  extremely  large,  or  even  infinite.  State-space  based  appro2u;hes,  then,  cannot  be  applied  directly 
to  the  model.  In  many  cases,  though,  the  system  spends  most  of  the  time  in  a  small  subset  of  the  entire 
state  space;  most  states  have  an  extremely  small  probability. 

This  is  particularly  true  of  highly  reliable  systems:  if  a  system  has  K  components,  and  if  each  component 
fails  with  a  very  small  rate  (as  is  normally  the  case),  states  with  more  than  a  handful  of  failed  components 
are  rarely  reached.  Indeed,  it  is  common  practice  in  reliability  modeling  to  stop  the  state-space  exploration 
after  k  K  failures,  with  the  implicit  assumption  that  states  with  ib  -|-  1  or  more  failed  components  have 
negligible  probability.  This  is  just  one  example  of  state  truncation. 

As  an  example,  consider  a  -processor  system,  where  nodes  fail  and  are  repaired  with  rate  A  and  ft, 
respectively.  We  want  to  compute  the  expected  cumulative  computational  capacity  during  the  time  interval 
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Figure  11:  A  typical  model  that  can  be  truncated. 


Figure  12:  Strict  truncation. 

[0,  t),  C'(t),  that  is,  the  expected  number  of  non-failed  processors  as  a  function  of  time,  integrated  between 
0  and  t.  If  the  state  is  characterized  by  the  number  of  working  processors,  the  model  corresponds  to  a 
birth-death  process  with  state  space  {K,K  —  1,...0)  (Figure  11).  If  the  processors  have  different  failure 
and  repair  behaviors,  the  identity  of  the  failed  processors  must  be  recorded  in  the  state  and  the  size  of  the 
state  space  grows,  dramatically,  from  +  1  to  2^. 

Formally,  given  a  reachability  graph  (<$,.4),  a  state  truncation  results  in  a  truncated  reachability  graph 

(5',  A'). 

If  (S',  A')  is  a  subgraph  of  (5,^),  the  exact  state-space  exploration  algorithm,  or  the  model,  is  simply 
modified  to  ignore  certain  arcs  which  lead  to  states  in  5  \  S'.  In  our  example,  we  can  prevent  a  k  +  1-th 
failure  in  a  state  which  already  has  k  failed  components.  We  call  this  case  “strict  truncation”  (Figure  12). 

Alternatively,  (5',^')  might  be  composed  by  a  subgraph  of  (5,>1),  augmented  with  one  or  more  states 
and  arcs.  In  our  example,  we  might  add  a  new  state  u  (for  unknown),  and  an  arc  from  each  state  with  k 
failed  components  to  u,  corresponding  to  further  failures  of  the  non-failed  components.  Strictly  speaking, 
this  is  more  an  “aggregation”,  so  we  call  this  approach  an  “aggregation  truncation”  (Figure  13). 

The  two  approaches  often  allow  us  to  obtain  upper  and  lower  bounds  on  the  measure  of  interest.  In  our 
example,  we  can  solve  the  two  CTMCs  of  Figures  12  and  13,  obtaining  two  transient  probability  vectors: 

and 

x»(t)  =  K(0,...TS:_t(f).<(<)] 

respectively.  If  we  associate  the  reward  rates 

Figure  13:  Aggregation  truncation. 
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and 

p‘  =  \p1<==K,...p%_,  =  K-k,p:  =  Q] 
with  the  states  of  the  two  CTMCs,  we  obtain  the  inequalities 

c*(o=  E  p:<>c{t)>  ^  Pt<  =  c‘‘it) 

ii{K,..K-k}  i6{K,..K-k,u) 

If  we  are  interested  in  the  expected  instantaneous  computational  capacity  in  steady  state,  c,  that  is,  the 
expected  number  of  non-failed  processors  in  the  long  run,  the  CTMC  in  Figure  12  still  offers  an  upper  bound, 
but  the  one  in  Figure  13  is  of  no  use,  since  state  u  has  probability  one  in  steady  state,  which  would  simply 
result  in  the  trivial  lower  bound  0  for  c.  In  any  case,  our  ability  to  obtain  useful  bounds  is  normally  tied  to 
our  a  priori  knowledge  of  aspects  of  the  CTMC  structure  and  values  of  the  reward  rates.  In  our  example, 
we  can  prove  that  C*(<)  is  an  upper  bound  on  C(i)  because  we  know  that 

•  Removing  the  set  of  states  {K  —  k  —  1, . .  .0}  does  not  decrease  the  probability  of  any  of  the  states  in 
{K,...K-k} 

•  The  maximum  reward  rates  of  the  states  in  {if  —  ib  —  1, . .  .0}  is  not  larger  than  the  minimum  reward 
rates  of  the  states  in  {K,...K  —  k}. 

and  we  can  prove  that  C(f)  is  a  lower  bound  on  C(t)  because  we  know  that 

•  Aggregating  the  set  of  states  {if  -  t  -  1, . .  .0}  into  a  single  absorbing  state  u  does  not  increase  the 
probability  of  any  of  the  states  in  {if, . .  .if  —  k} 

•  The  minimum  reward  rates  of  the  states  in  {if  •>  ib  —  1, . .  .0}  is  not  smaller  than  0,  the  reward  rate  of 
states  u. 

For  steady  state  analysis,  more  sophisticated  arguments  based  on  [17]  can  be  used  [49].  We  conclude  by 
observing  that  simulation  is,  in  a  probabilistic  sense,  a  form  of  automatic  truncation,  since  the  most  likely 
Si  ss  are  visited  frequently  while  unlikely  states  may  not  be  visited  at  all. 

Lumping  Most  complex  systems  (models)  consist  of  a  large  set  of  systems  (submodels),  many  of  them  of 
the  same  type.  The  state  of  the  system  is  then  obtained  by  composing  the  state  of  each  subsystem.  When 
performing  state-space  exploration,  though,  there  are  simplifications  which  might  lead  to  a  smaller  state 
space  while  still  allowing  an  exact  solution.  For  example,  in  our  system  with  K  processors,  we  could  model 
each  of  them  as  an  independent  subsystem  which  can  be  in  one  of  two  states,  up  or  down.  The  entire  system 
can  then  be  viewed  as  composed  of  K  such  subsystems,  thus  having  2^  states.  This  approach  is  wasteful, 
though,  since  it  is  not  necessary  to  distinguish  between  processors,  if  they  all  have  the  same  failure  and 
repair  behavior.  Rather,  we  can  represent  the  state  of  the  system  as  the  number  of  subsystems  in  each  state 
(up  or  down,  but,  since  the  total  number  of  processors  is  known,  we  can  simply  remember  the  number  of  up 
processors). 
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This  application  of  lumping  [30,  53]  is  indeed  so  natural  that  we  used  it  in  conjunction  with  truncation, 
without  even  justifying  its  adoption.  In  real  systems,  though,  the  reachability  graph  of  a  subsystem  might 
be  quite  complex.  The  general  algorithm  to  obtain  the  lumped  state  space  for  a  system  consisting  of  K 
independent  subsystems  can  be  easily  expressed  making  use  of  SPNs  [13]  (see  also  [29]  for  an  example  of  use 
of  this  algorithm): 

1.  Generate  the  reachability  graph  for  a  single  subsystem.  Markings  and  arcs  are  labeled  with  the  number 
of  tokens  in  each  place  and  the  name  of  the  corresponding  transition,  respectively. 

2.  Transform  the  reachability  graph  into  a  SPN:  for  each  marking  t,  add  a  place  pi,  initially  empty;  for 

each  arc  from  state  i  labeled  by  transition  t,  add  a  transition  ti  with  marking-dependent  rate  equal 
#(p,')  times  the  rate  of  in  marking  t  for  a  single  subsystem,  an  input  arc  from  p,  to  (i,  and  an  output 
arc.from  U  to  pj  is  the  number  of  tokens  in  place  pi). 

3.  Set  the  initial  marking  of  the  SPN:  for  each  subsystem,  if  its  initial  state  is  t,  add  a  token  in  pi.  Note 
that  the  subsystems  can  start  in  a  different  initial  state  without  affecting  the  correctness  of  lumping. 

4.  Generate  the  CTMC  underlying  this  SPN. 


Figure  14  shows  the  application  of  the  algorithm  to  a  system  composed  of  K  dual-redundant  subsystems, 
where  repair  is  initiated  only  when  both  units  have  failed.  Each  subsystem  is  described  by  a  SPN  whose 
reachability  graph  has  four  markings.  If  no  lumping  is  applied,  the  total  number  of  states  is  4^.  The 
application  of  our  algorithm,  instead,  results  in  a  SPN  with  (K  +  Z){K  +  2)(/f  +  l)/6  states. 

In  general,  if  there  are  K  subsystems  with  N  states  each,  the  size  of  state  space  with  and  without  lumping 


is 


=  N  X  ■  ■  ■  X  N  vs. 

K  terms 


fN+K-l\  N+K-1  N+IN 

[  K  J-  K  2  1 

\  /  V  ■  V - - 


K  terms 

Each  of  the  K  terms  in  the  second  case  is  smaller  than  N,  with  the  exception  of  the  last  one,  which  is  N, 
so  this  approach  is  always  guaranteed  to  reduce  the  size  of  the  state  space.  The  reduction  is  particularly 
sizable  when  N  is  small  and  K  is  large:  for  example,  when  N  =  2  we  have  2^  vs.  K  +  1. 

In  practice,  the  submodels  have  some  interaction,  so  independence  does  not  hold.  If  the  interaction  is 
limited  to  a  “rate  dependence”  [14]  where  the  transition  rates  in  a  subsystem  depend  on  the  number  of 
subsystems  in  certain  states,  but  not  on  their  identity,  the  algorithm  can  still  be  applied:  only  a  different 
specification  of  the  firing  rates  for  the  resulting  SPN  is  needed.  In  our  example,  the  repairperson  could  be 
a  shared  resource,  so  the  rate  of  transition  repair  in  each  subsystem  could  be  A//^  \  where  /  is  the  total 
number  of  subsystems  being  repaired,  and  the  exponent  1.1  models  the  inherent  inefficiency  due  to  resource 
sharing.  The  rate  of  transition  repatrioio  in  the  resulting  SPN  should  then  be  specified  as  A/#(pioio)^  S 
where  #(pioio)  indicates  the  number  of  tokens  in  piojo  or,  in  other  words,  /. 

Other  types  of  dependence  are  structural:  often,  tokens  might  have  to  move  from  a  submodel  to  another 
portion  of  the  global  model.  With  some  care,  lumping  might  still  be  possible  [57]. 


16 


Composition  In  this  approach,  the  overall  model  is  composed  of  a  set  of  submodels.  Construction  and 
generation  of  a  large  model  is  avoided  and  the  solution  is  obtained  by  interactions  among  the  submodels. 
Interactions  imply  exchange  of  information  between  the  submodels.  Reward  based  performability  analysis 
[44,  63]  is  an  example  of  composition  of  reliability  and  performance  models.  The  performance  submodel  is 
solved  and  its  results  are  passed  as  reward  rates  to  the  reliability  submodel.  In  general,  quantities  such  as 
probability  distributions,  mean,  variance,  or  numerical  values  of  reliability  and  availability  are  exchanged 
among  submodels. 

Other  examples  of  composition  include  flow-equivalent  server  approximation  introduced  by  Chandy  et 
al  [10],  behavioral  decomposition  used  in  the  software  tool  HARP  [21],  composition  of  GSPNs  and  queuing 
networks  proposed  hy  Balbo  et  al  [3],  and  hybrid  hierarchical  composition  employed  in  the  software  tool 
SHARPE  [56].  These  approaches  can  be  classified  as  hierarchical  composition  techniques.  Hierarchical 
composition  approaches  differ  not  only  in  the  way  the  model  is  coub  ructed  but  also  in  the  way  the  model 
is  solved.  The  set  of  submodels  can  be  solved  iteratively  using  a  fixed-point  iteration  scheme  (a  cyclic 
dependence  exists  among  the  submodels)  [12, 14,  47,  61]  or  in  a  non-iterative  fashion  (a  strict  hierarchy  exists 
among  the  submodels)  [43,  56].  For  a  unified  view  of  these  seemingly  different  approaches  to  hierarchical 
composition,  refer  to  [42]. 

Fluid  Models  As  the  number  of  tokens  in  a  place  or  the  number  of  jobs  in  a  queue  becomes  large,  the  size 
of  the  underlying  CTMC  grows.  It  may  be  possible  to  approximate  the  number  of  tokens  in  the  place,  or 
the  number  of  jobs  in  the  queue,  as  a  non-negative  real  number.  It  is  then  possible  to  write  the  differential 
equations  for  the  dynamic  behavior  of  the  model  and,  in  some  cases,  provide  solution.  Mitra  has  developed 
models  along  these  lines  [46].  More  recently,  Kulkarni  and  IVivedi  have  proposed  fluid  stochastic  Petri  nets 
(FSPNs)  [33]. 
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5.2  Stiflfhess 


CTMC  stiffness  is  a  computational  problem  which  adversely  affects  the  stability,  accuracy,  and  efficiency  of 
a  numerical  solution  method  unless  that  method  has  been  specially  designed  to  handle  it.  CTMC  stiffness 
is  caused  by  extreme  disparity  between  transition  rates.  In  a  reliability  model,  repair  rates  could  be  10^ 
times  the  failure  rates.  In  a  monolithic  performability  model,  the  job  arrival  rates  could  be  10^  times  the 
component  failure  rates.  In  this  section,  we  discuss  how  stiffness  can  be  overcome.  To  begin  with,  we  describe 
how  the  extreme  disparity  between  transition  rates  translates  into  a  computational  problem  for  numerical 
solution  methods. 

Let  us  consider  the  linear  system  of  differential  equations  in  Equation  2.  This  system  is  considered  stiff 
if  the  solution  has  components  whose  rates  of  change  (decay  or  gain)  differ  greatly.  The  rate  of  change  of 
each  solution  component  is  governed  by  the  magnitude  of  an  eigenvalue  of  the  generator  matrix  Q  .  This 
system  is  considered  stiff  if  for  t  =  2,  ...,m,  Re(Xi)  <  0  and 

max|i2e(Ai)|  >>  min|i2e(Ai)|  , 

t  t 

where  A,-  are  the  eigenvalues  of  Q.  The  rate  of  change  of  a  solution  component  is  defined  relative  to  the 
solution  interval,  hence  Miranker  [45]  gave  the  following  definition:  “a  system  of  differential  equations  is  said 
to  be  stiff  in  the  interval  [0,  t)  if  there  exists  a  solution  component  of  the  system  which  has  variation  in  that 
interval  that  is  large  compared  to  l/t”.  However,  the  CTMC  attains  numerical  steady-state  at  some  finite 
time  t„ :  within  the  specified  accuracy  (or  error  tolerance)  the  state  probability  vector  does  not  change  with 
increase  in  time.  Hence  we  may  redefine  stiffness:  “the  system  of  differential  equations  in  Equation  2  is 
said  to  be  stiff  in  the  interval  [0,  t)  if  there  exists  a  solution  component  of  the  system  which  has  variation  in 
that  interval  that  is  large  compared  to  l/min{f,t,,}.  The  large  difference  in  transition  rates  of  the  CTMC 
approximately  translates  into  large  difference  in  magnitude  of  the  eigenvalues  of  the  generator  matrix. 

Stiffness  could  cause  numerical  instability  and  make  the  solution  methods  inefficient  if  the  methods  are 
not  designed  to  handle  stiffness.  Like  largeness,  two  basic  approaches  to  overcome  stiffness  are:  stiffness 
avoidance  or  stiffness  tolerance. 

5.2.1  Stiffiiess  Avoidance 

According  to  this  approach,  stiffness  is  eliminated  from  a  model  by  applying  some  approximation  scheme. 
This  results  in  a  set  of  non-stiff  models  which  are  then  solved  to  obtain  the  overall  solution.  Bobbio  and 
Trivedi  [8]  have  designed  one  such  technique  based  on  aggregation.  Most  of  these  approaches  avoid  largeness 
as  well,  since  some  kind  of  model  decomposition  or  aggregation  is  involved. 

5.2.2  Stillness  Tolerance 

Special  solution  methods  that  are  designed  to  handle  stiffness  are  used  in  this  approach.  The  two  most 
commonly  used  methods  for  transient  analysis  of  CTMCs  are  uniformization  and  numerical  ODE  solution 
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methods.  It  has  been  shown  [54,  40]  that  uniformization  is  inefficient  for  stiff  CTMCs.  A  modified  implemen¬ 
tation  of  uniformization  which  incorporates  steady-state  detection  of  the  underlying  discrete-time  Markov 
chain  (DTMC)  [50]  was  shown  to  be  more  efficient  than  the  standard  implementation  when  the  solution 
interval  was  larger  than  However,  uniformization  remains  much  more  inefiScient  than  L-stable  ODE 
methods  [38].  L-stable  ODE  methods  [34]  are  recommended  for  stiff  CTMCs.  Among  these,  second-order 
TR-BDP2  [54]  is  efficient  for  low  accuracy  requirements  and  third  order  implicit  Runge-Kutta  method  [40] 
is  efficient  for  high  accuracy  requirements.  Recently,  more  efficient  methods  based  on  stiffness  detection  [37] 
have  been  proposed. 

5.3  Non-exponential  distributions 

5.3.1  Phase  Approxiinations 

The  basic  methodology  of  phase  approximations  is  to  replace  a  non-exponential  distribution  in  a  model  by 
a  set  of  states  and  transitions  between  those  states  such  that  the  holding  time  in  each  state  is  exponentially 
distributed.  This  follows  from  Cox  [18],  who  showed  that  any  non-exponential  probability  distribution  with 
rational  Laplace  Steiltjes  transform  (LST)  can  be  represented  by  a  series  of  exponential  stages  with  complex 
valued  transition  rates.  Each  stage  is  entered  with  some  probability  and  exited  (the  process  stops)  with 
complementary  probability.  However,  conditions  to  determine  whether  the  resulting  function  is  a  proper 
cdf  or  not  are  not  known.  To  overcome  this  problem,  Neuts  [52]  restricted  the  Coxian  representation  by 
defining  phase  type  distributions  as  absorbing-time  distributions  of  a  CTMC  with  at  least  one  absorbing 
state.  Non-exponential  distributions  can  be  approximated  by  phase  type  distributions  (also  known  as  phase 
approximations  when  used  in  this  context).  Distributions  without  rational  LSTs  czm  be  approximated  by 
distributions  having  rational  LSTs,  although,  arbitrarily  close  approximations  may  require  a  CTMC  with  a 
large  state  space. 

A  complete  approach  to  phase  approximations  b  discussed  in  [39].  This  approach  consists  of  a  few  basic 
steps: 

•  Selecting  a  phase  approximation  class  for  a  given  distribution.  One  of  the  most  commonly  used  phase 
approximation  classes  is  a  mixture  of  Erlang  distributions  [9].  It  has  been  used  in  [26,  39,  60]  and 
good  fits  to  some  commonly  occurring  distributions  such  as  Weibull,  deterministic,  lognormal,  and 
uniform  have  been  obtained.  Schmickler  [58]  has  used  mixtures  of  Erlang  distributions  to  fit  empirical 
functions.  Bobbio  et  al.  [6,  4,  5]  have  used  a  different  kind  of  acyclic  phase  approximation  and  obtained 
good  fits  to  several  distributions. 

•  Obtaining  the  parameters  of  phase  approximations.  Once  a  suitable  phase  approximation  has  been 
chosen  for  a  given  distribution  (which  may  be  in  empirical  form),  the  next  step  is  to  fit  the  parameters 
of  this  phase  approximation.  The  choices  include  moment  matching,  function  (cdf  or  pdf)  fitting, 
maximum  likelihood  estimation  (in  case  of  empirical  distributions),  or  a  combination  of  these  [9,  39]. 
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Johnson  and  Taffe  [27,  28]  have  considered  matching  the  first  three  moments  of  mixtures  of  two  Erlang 
distributions.  For  more  references  on  this  topic,  refer  to  [7]. 

•  Generation  of  the  overall  CTMC.  After  the  parameters  of  phase  approximations  for  all  the  non¬ 
exponential  distributions  have  been  fitted  (or  estimated),  the  overall  CTMC  is  generated.  This  may 
require  the  cross-product  of  phase  approximations  [39]. 

A  few  software  packages  implementing  this  approach  have  been  developed.  Phase  approximations  were 
used  in  the  SURF  package  [16],  although  SURF  was  intended  only  for  a  restricted  class  of  reliability  models. 
Cumani  [19]  has  designed  the  software  package  ESP  for  evaluation  of  SPNs  with  phase-type  distributed  firing 
times.  Phase  approximations  for  a  class  of  non-Markovian  models  have  been  implemented  in  GSHARPE 
[39].  GSHARPE  is  a  front  end  for  a  general  purpose  performance  and  reliability  modeling  toolkit  called 
SHARPE  [56].  It  accepts  a  non-Markovian  model  and  converts  it  into  a  CTMC  in  SHARPE  syntax  after 
applying  phase  approximations. 

5.3.2  Non-homogeneous  CTMCs 

If  transition  rates  in  a  CTMC  are  allowed  to  be  time-dependent,  where  time  is  measured  from  the  beginning  of 
system  operation,  the  model  becomes  a  non-homogeneous  CTMC.  Such  models  are  used  in  software  reliability 
under  the  naune  of  NHPP  (Non-Homogeneous  Poisson  Process)  [51]  and  in  hardware  reliability  models  of 
non-repairable  systems  [22].  Tools  such  as  CARE  III  and  HARP  allow  component  failure  distributions  to 
be  Weibull  using  this  approach. 

5.3.3  Markov  regenerative  processes  (MRGPs) 

The  use  of  non-homogeneous  CTMC  allows  transition  rates  to  be  globally  time-dependent  while  the  use 
of  SMPs  allow  the  time  dependence  to  be  local  (since  the  entry  into  the  state).  Both  of  these  are  often 
inadequate  in  practice.  While,  in  principle,  the  phase  approximations  allow  more  general  time  dependence, 
their  practical  usefulness  is  limited  by  the  increased  size  of  the  underlying  stochastic  process,  which  further 
exacerbates  the  largeness  problem.  MRGPs  seem  to  provide  a  useful  time-dependence  that  can  capture 
many  interesting  practical  scenarios.  The  basic  idea  is  that  not  every  state  change  is  required  to  be  a 
regeneration  point.  Thus,  in  a  multi-component  system  with  each  component  having  exponential  time-to- 
failure  distribution  and  a  generally  distributed  repair  with  a  single  repairperson  (FCFS),  the  underlying 
stochastic  process  is  a  MRGP  (but  not  a  SMP  or  a  CTMC).  Recent  work  on  this  topic  can  be  found  in 
[11,  15]. 
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6  Conclusion 


We  discussed  several  types  of  modeling  techniques  used  in  dependability  and  performability  analysis,  with 
a  particular  emphasis  on  approaches  based  on  the  (entire  or  partial)  generation  of  the  state-space. 

The  common  underlying  formalisms  we  consider,  continuous-time  Markov  chains  (CTMCs)  and  Markov 
reward  models  (MRMs),  are  capable  of  modeling  a  large  class  of  systems,  but  they  result  in  large  models, 
difficult  to  describe  and  analyze.  The  description  problem  is  solved  by  using  higher-level  formalisms,  such 
as  reliability  graphs,  fault  trees,  queueing  networks,  generalized  stochastic  Petri  nets,  and  stochastic  reward 
nets.  With  the  appropriate  software  modeling  tools,  these  can  then  be  automatically  translated  into  CTMCs 
or  MRMs. 

The  solution  problem,  though,  remains,  since  the  size  of  the  underlying  stochastic  process  grows  combi- 
natorially.  In  addition,  when  modeling  activities  with  very  different  time-scales,  such  as  failure  and  repair 
of  components,  and  performance-related  behavior,  such  as  arrival  and  departure  of  jobs,  stiffness  arises. 
Advanced  numerical  techniques,  and  exact  or  approximate  approaches  such  as  truncation,  aggregation,  com¬ 
position,  and  fluid  models,  can  then  be  effectively  used  to  obtain  numerical  solutions. 
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